12 research outputs found

    Towards Sensorimotor Coupling of a Spiking Neural Network and Deep Reinforcement Learning for Robotics Application

    Get PDF
    Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the great achievements of deep reinforcement learning in various domains including finance,medicine, healthcare, video games, robotics and computer vision.Deep neural network was started with multi-layer perceptron (1stgeneration) and developed to deep neural networks (2ndgeneration)and it is moving forward to spiking neural networks which are knownas3rdgeneration of neural networks. Spiking neural networks aim to bridge the gap between neuroscience and machine learning, using biologically-realistic models of neurons to carry out computation. In this thesis, we first provide a comprehensive review on both spiking neural networks and deep reinforcement learning with emphasis on robotic applications. Then we will demonstrate how to develop a robotics application for context-aware scene understanding to perform sensorimotor coupling. Our system contains two modules corresponding to scene understanding and robotic navigation. The first module is implemented as a spiking neural network to carry out semantic segmentation to understand the scene in front of the robot. The second module provides a high-level navigation command to robot, which is considered as an agent and implemented by online reinforcement learning. The module was implemented with biologically plausible local learning rule that allows the agent to adopt quickly to the environment. To benchmark our system, we have tested the first module on Oxford-IIIT Pet dataset and the second module on the custom-made Gym environment. Our experimental results have proven that our system is able present the competitive results with deep neural network in segmentation task and adopts quickly to the environment

    Towards Multi-modal Explainable Video Understanding

    Get PDF
    This thesis presents a novel approach to video understanding by emulating human perceptual processes and creating an explainable and coherent storytelling representation of video content. Central to this approach is the development of a Visual-Linguistic (VL) feature for an interpretable video representation and the creation of a Transformer-in-Transformer (TinT) decoder for modeling intra- and inter-event coherence in a video. Drawing inspiration from the way humans comprehend scenes by breaking them down into visual and non-visual components, the proposed VL feature models a scene through three distinct modalities. These include: (i) a global visual environment, providing a broad contextual understanding of the scene; (ii) local visual main agents, focusing on key elements or entities in the video; and (iii) linguistic scene elements, incorporating semantically relevant language-based information for a comprehensive understanding of the scene. By integrating these multimodal features, the VL representation offers a rich, diverse, and interpretable view of video content, effectively bridging the gap between visual perception and linguistic description. To ensure the temporal coherence and narrative structure of the video content, we introduce an autoregressive Transformer-in-Transformer (TinT) decoder. The TinT design consists of a nested architecture where the inner transformer models the intra-event coherency, capturing the semantic connections within individual events, while the outer transformer models the inter-event coherency, identifying the relationships and transitions between different events. This dual-layer transformer structure facilitates the generation of accurate and meaningful video descriptions that reflect the chronological and causal links in the video content. Another crucial aspect of this work is the introduction of a novel VL contrastive loss function. This function plays an essential role in ensuring that the learned embedding features are semantically consistent with the video captions. By aligning the embeddings with the ground truth captions, the VL contrastive loss function enhances the model\u27s performance and contributes to the quality of the generated descriptions. The efficacy of our proposed methods is validated through comprehensive experiments on popular video understanding benchmarks. The results demonstrate superior performance in terms of both the accuracy and diversity of the generated captions, highlighting the potential of our approach in advancing the field of video understanding. In conclusion, this thesis provides a promising pathway toward building explainable video understanding models. By emulating human perception processes, leveraging multimodal features, and incorporating a nested transformer design, we contribute a new perspective to the field, paving the way for more advanced and intuitive video understanding systems in the future

    Towards Multi-modal Explainable Video Understanding

    Get PDF
    This thesis presents a novel approach to video understanding by emulating human perceptual processes and creating an explainable and coherent storytelling representation of video content. Central to this approach is the development of a Visual-Linguistic (VL) feature for an interpretable video representation and the creation of a Transformer-in-Transformer (TinT) decoder for modeling intra- and inter-event coherence in a video. Drawing inspiration from the way humans comprehend scenes by breaking them down into visual and non-visual components, the proposed VL feature models a scene through three distinct modalities. These include: (i) a global visual environment, providing a broad contextual understanding of the scene; (ii) local visual main agents, focusing on key elements or entities in the video; and (iii) linguistic scene elements, incorporating semantically relevant language-based information for a comprehensive understanding of the scene. By integrating these multimodal features, the VL representation offers a rich, diverse, and interpretable view of video content, effectively bridging the gap between visual perception and linguistic description. To ensure the temporal coherence and narrative structure of the video content, we introduce an autoregressive Transformer-in-Transformer (TinT) decoder. The TinT design consists of a nested architecture where the inner transformer models the intra-event coherency, capturing the semantic connections within individual events, while the outer transformer models the inter-event coherency, identifying the relationships and transitions between different events. This dual-layer transformer structure facilitates the generation of accurate and meaningful video descriptions that reflect the chronological and causal links in the video content. Another crucial aspect of this work is the introduction of a novel VL contrastive loss function. This function plays an essential role in ensuring that the learned embedding features are semantically consistent with the video captions. By aligning the embeddings with the ground truth captions, the VL contrastive loss function enhances the model\u27s performance and contributes to the quality of the generated descriptions. The efficacy of our proposed methods is validated through comprehensive experiments on popular video understanding benchmarks. The results demonstrate superior performance in terms of both the accuracy and diversity of the generated captions, highlighting the potential of our approach in advancing the field of video understanding. In conclusion, this thesis provides a promising pathway toward building explainable video understanding models. By emulating human perception processes, leveraging multimodal features, and incorporating a nested transformer design, we contribute a new perspective to the field, paving the way for more advanced and intuitive video understanding systems in the future

    CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

    Full text link
    Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model long- and short-range temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study conducted on each component confirms its effectiveness in the problem, and the extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on two commonly-used benchmark datasets in the VAD problem (UCF-Crime and ShanghaiTech Campus). The source code will be made publicly available upon acceptance.Comment: Under Submissio

    VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

    Full text link
    Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.Comment: Accepted to AAAI 202

    Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

    Full text link
    Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusio

    Spiking Neural Networks and Their Applications: A Review

    No full text
    The past decade has witnessed the great success of deep neural networks in various domains. However, deep neural networks are very resource-intensive in terms of energy consumption, data requirements, and high computational costs. With the recent increasing need for the autonomy of machines in the real world, e.g., self-driving vehicles, drones, and collaborative robots, exploitation of deep neural networks in those applications has been actively investigated. In those applications, energy and computational efficiencies are especially important because of the need for real-time responses and the limited energy supply. A promising solution to these previously infeasible applications has recently been given by biologically plausible spiking neural networks. Spiking neural networks aim to bridge the gap between neuroscience and machine learning, using biologically realistic models of neurons to carry out the computation. Due to their functional similarity to the biological neural network, spiking neural networks can embrace the sparsity found in biology and are highly compatible with temporal code. Our contributions in this work are: (i) we give a comprehensive review of theories of biological neurons; (ii) we present various existing spike-based neuron models, which have been studied in neuroscience; (iii) we detail synapse models; (iv) we provide a review of artificial neural networks; (v) we provide detailed guidance on how to train spike-based neuron models; (vi) we revise available spike-based neuron frameworks that have been developed to support implementing spiking neural networks; (vii) finally, we cover existing spiking neural network applications in computer vision and robotics domains. The paper concludes with discussions of future perspectives

    Spiking Neural Networks and Their Applications: A Review

    No full text
    The past decade has witnessed the great success of deep neural networks in various domains. However, deep neural networks are very resource-intensive in terms of energy consumption, data requirements, and high computational costs. With the recent increasing need for the autonomy of machines in the real world, e.g., self-driving vehicles, drones, and collaborative robots, exploitation of deep neural networks in those applications has been actively investigated. In those applications, energy and computational efficiencies are especially important because of the need for real-time responses and the limited energy supply. A promising solution to these previously infeasible applications has recently been given by biologically plausible spiking neural networks. Spiking neural networks aim to bridge the gap between neuroscience and machine learning, using biologically realistic models of neurons to carry out the computation. Due to their functional similarity to the biological neural network, spiking neural networks can embrace the sparsity found in biology and are highly compatible with temporal code. Our contributions in this work are: (i) we give a comprehensive review of theories of biological neurons; (ii) we present various existing spike-based neuron models, which have been studied in neuroscience; (iii) we detail synapse models; (iv) we provide a review of artificial neural networks; (v) we provide detailed guidance on how to train spike-based neuron models; (vi) we revise available spike-based neuron frameworks that have been developed to support implementing spiking neural networks; (vii) finally, we cover existing spiking neural network applications in computer vision and robotics domains. The paper concludes with discussions of future perspectives
    corecore